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Abstract 

Several popular languages including Haskell and Python use the 
indentation and layout of code as an essential part of their syntax. 
In the past, implementations of these languages used ad hoc tech¬ 
niques to implement layout. Recent work has shown that a simple 
extension to context-free grammars can replace these ad hoc tech¬ 
niques and provide both formal foundations and efficient parsing 
algorithms for indentation sensitivity. 

However, that previous work is limited to bottom-up, LR(fc) 
parsing, and many combinator-based parsing frameworks including 
Parsec use top-down algorithms that are outside its scope. This 
paper remedies this by showing how to add indentation sensitivity 
to parsing frameworks like Parsec. It explores both the formal 
semantics of and efficient algorithms for indentation sensitivity. 
It derives a Parsec-based library for indentation-sensitive parsing 
and presents benchmarks on a real-world language that show its 
efficiency and practicality. 

Categories and Subject Descriptors D.3.1 [ Programming Lan¬ 
guages]: Formal Definitions and Theory—Syntax; D.3.4 [ Pro¬ 
gramming Languages]: Processors—Parsing; F.4.2 [Mathemati¬ 
cal Logic and Formal Languages] : Grammars and Other Rewriting 
Systems—Parsing 

General Terms Algorithms, Languages 

Keywords Parsing; Parsec; Indentation sensitivity; Layout; Off¬ 
side rule 

1. Introduction 

Languag es such as Haskell (IMarlow (ed.)||2010) and Python 
i |Python| use the indentation of code to delimit various grammati¬ 
cal forms. For example, in Haskell, the contents of a let, where, 
do, or case expression can be indented relative to the surround¬ 
ing code instead of being explicitly delimited by curly braces. For 
example, one may write: 

mapAccumR f = loop 

where loop acc (x:xs) = (acc’’, x’ : xs’) 
where (acc’’, x’) = f acc’ x 

(acc’, xs’) = loop acc xs 
loop acc [] = (acc, []) 
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The indentation of the bindings after each where keyword de¬ 
termines the structure of this code. For example, the indentation of 
the last line determines that it is part of the bindings introduced by 
the first where instead of the second where. 

While Haskell and Python are well known for being indenta¬ 
tion sensitive, a large number of other languages also use indenta¬ 
tion. These include ISWIM (|Landin|1966), Occam (|INMOS Lim-| 
|ited|1984|>, Or well l |Wadler|1985|>, Mirand a l |Turner|l989|l, SR FI-49 
l|M(iller|2005), Curry (|Hanus (ed.)|2006), YAML (|Ben-Kiki et al.| 

|2009( i. Habit dHASP Project|2QloV, F# i|Syme et al.|2010) Mark¬ 

down ([Gruber} , reStructuredText < |Goodger|2012[ i, and Idris ( |Brady| 
|2013a| i. Unfortunately, implementations of these languages often 
use ad hoc techniques to implement indentation. Even the language 
specifications themselves describe indentation informally or with 
formalisms that are not suitable for impleme ntation. 

Previous work on indentation sensitivity ( |Adams|2013} demon¬ 
strated a grammar formalism for expressing layout rules that is an 
extension of context-free grammars and is both theoretically sound 
and practical to i mplement in terms of bot tom-up, LR(fc) parsing. 
However, Parsec ( |Leijen and Martini|2012| , like many combinator- 
based libraries, does not use the LR(fc) algorithm. It is top-down in¬ 
stead of bottom-up and thus is outside the scope of that work. This 
paper extends that work to encompass such systems. We show that 
this extension both has a solid theoretical foundation and is practi¬ 
cal to implement. The resulting indentation-sensitive grammars are 
easy and convenient to write, and fast, efficient parsers can be easily 
implemented for them. Our implementation of these techniques is 
available as the indentation package on the Hackage repository. 
The organization and contributions of this paper are as follows. 

- In Section |2] we review parsing expression grammars (PEG) 
and give an informal description of a grammar formalism for 
expressing indentation sensitivity. 

- In Section[3] we demonstrate the expressivity of this formalism 
by reviewing the layout rules of Haskell and Python and then 
showing how to express them in terms of this grammar formal- 

- In Section[4] we formalize the semantics of PEG and define an 
indentation-sensitive, PEG-based semantics for this grammar 
formalism. 

- In Section [5] we examine the internals of Parsec, show the 
correspondence between it and PEG, and demonstrate how to 
implement indentation sensitivity in Parsec. 

- In Section[6] we benchmark our implementation on a real-world 
language, and we show it to be practical, effective, and efficient 
at defining layout rules. 

- In Section[7J we review related work and other implementations 
of indentation sensitivity. 

- In Section[8] we conclude. 
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Figure 1. Syntax of PEG parsing expressions 


2. The Basic Idea 

2.1 Parsing Expression Grammars 

The basic idea for indentation sensitivity is the same as in | Adams] 
]2013[ > except that we aim to implement it for the top-down, 
combinator-based parsing algorithms used in Parsec. In order to do 
this, we base our semantics on parsing expression grammars (PEG) 
instead of context-free grammars (CFG) as they more closely align 
with the algorithms used by Parsec. In Section [4~i~| we review the 
formal semantics of PEG, but at a basic level, the intuition behind 
PEG is simple. As in a CFG, there are terminals and non-terminals. 
However, in a CFG, each non-terminal corresponds to several pro¬ 
ductions that each map the non-terminal to a sequence of terminals 
and non-terminals. In a PEG, on the other hand, each non-terminal 
corresponds to a single parsing expression. Where in a CFG we 
might have the productions A —> ’a ’A and A —> ’ b ’, in PEG we 
have the single production A —► (’a ’; A) (|) ’b\ 

The syntax of these parsing expressions is defined as shown 
in Figure [T] where p, pi, and p2 are parsing expressions. These 
operators behave as one would expect with minor adjustments for 
the choice and repetition operators. These two are special in that 
they are biased. The choice operator is left biased and attempts p2 
only if pi fails. Likewise, the repetition operator is greedy and, 
when possible, matches more rather than fewer repetitions. These 
biases ensure the uniqueness of the parse result, and thus PEG 
avoids the ambiguity problems that can arise with a CFG. 

A number of other operators exist in PEG including optional 
terms, non-empty repetition (i.e., Kleene plus), positive lookahead, 
and a fail operator, but those operators are derived forms that are 
not needed in this paper. 

2.2 Indentation Sensitivity 

In order to support indentation-sensitive parsing, we first modify 
the usual notion of parsing by annotating every token in the input 
with the column at which it occurs in the source code. We call this 
its indentation and write a 1 for a token a at indentation i. 

During parsing we annotate each sub-tree of the parse tree with 
an indentation as in Figure [2] These annotations coincide with the 
intuitive notion of how far a block of code is indented. Thus, the 
sub-tree rooted at A 5 is a block indented to column 5. We then place 
constraints on how the indentations of sub-trees relate to those of 
their parents. This is formally achieved by introducing an operator 
p > that specifies that the indentation of a tree parsed by p must have 
the relation > relative to that of its parent where > is a given numeric 
relation. For example, we write p > to specify that a tree parsed by 
p must have a strictly greater indentation than its parent. In all other 
places, parent and child must have identical indentations. Note that 
the indentation of a sub-tree does not directly affect the indentation 
of its tokens. Rather, it imposes restrictions on the indentations 
of its immediate children, which then impose restrictions on their 
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children and so on until we get to tokens. At any point, these 
restrictions can be locally changed by the p > operator. 

As a simple example, we may write A —> ’ (’; A > : ’ to 

mean that ( and ) must be at the same indentation as the A on 
the left of the production arrow, but the A on the right must be at 
a greater indentation. We may also write A —¥ ’ A > ; ’] 

to mean the same except that [ and ] must be at an indentation 
greater than or equal to the indentation of the A on the left of the 
production arrow. In addition, we may write A —¥ B* to mean that 
the indentation of each B must be equal to that of A. 

If we combine these, we can get a grammar for indented paren¬ 
theses and square brackets as follows. 

A-+ (I) ’[’-;A > ; >]'*)* 

In that grammar, matching parentheses must align vertically, and 
things enclosed in parentheses must be indented more than the 
parentheses. Things enclosed in square brackets merely must be 
indented more than the surrounding code. Figure [2] shows exam¬ 
ples of parse trees for this grammar on the words (' [ 4 ( B ) B ] 7 ) 1 
and c 1 [ 8 ( 6 ) 6 [ 8 ] 9 ] 4 ( 3 ) 3 ) 1 . In these parse trees, note how the in¬ 
dentations of the non-terminals and terminals relate to each other 
according to the indentation relations specified in the grammar. 

While in principle any set of indentation relations can be used, 
we restrict ourselves to the relations =, >, >, and ® as these cover 
the indentation rules of most languages. The =, >, and > relations 
have their usual meanings. The © relation is {(i, j) \ i,j S N} and 
disassociates the indentation of a child from that of its parent. 

Finally, indentation-sensitive languages typically have forms 
where the first token of a subexpression determines the indenta¬ 
tion of the rest of the subexpression. For example, in Haskell the 
branches of a case must all align and have their the initial tokens at 
the same indentation as each other. To handle this, we introduce the 
|p | operator, which behaves identically to p except that its indenta¬ 
tion is always equal to the indentation of the first token of p. In the 
context of a CFG, this operator can be defined as mere syntactic 
sugar i |Adams|2013| . However, PEG’S lookahead operator makes 
this difficult to specify as a desugaring. Thus we introduce it as a 
first-class operator and formally specify its behavior in Section [4~2] 

3. Indentation-Sensitive Languages 

Despite the simplicity of this framework for indentation sensitivity, 
it can express a wide array of layout rules. We demonstrate this by 
reviewing the layout rules for Haskell and Python and then show- 
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ing how they can be expressed as indentation-sensitive grammars. 
Though not shown here, sketches for other indentation-sensitive 
languages have been constructed for ISWIM, Miranda, occamQOr- 
well, Curry, Hab it, Idris, and SR FI-49. Those already familiar with 
the techniques in |Adams|p0l3] > can safely skip this section. 

3.1 Haskell 

3.1.1 Language 

In Haskell, indentation-sensitive blocks (e.g., the bodies of do, 
case, or where expressions) are made up of one or more state¬ 
ments or clauses that not only are indented relative to the surround¬ 
ing code but also are indented to the same column as each other. 
Thus, lines that are more indented than the block continue the cur¬ 
rent clause, lines that are at the same indentation as the block start 
a new clause, and lines that are less indented than the block are 
not part of the block. In addition, semicolons (;) and curly braces 
({ and >) can explicitly separate clauses and delimit blocks, re¬ 
spectively. Explicitly delimited blocks are exempt from indentation 
restrictions arising from the surrounding code. 

While the indentation rules of Haskell are intuitive to use in 
practice, the way that they are formally expressed in the Haskell 
language specification ( |Marlow (ed.)|20ld] §10.3) is not nearly so 
intuitive. The indentation rules are specified in terms of both the 
lexer and an extra pass between the lexer and the parser. Roughly 
speaking, the lexer inserts special -fin} tokens where a new block 
might start and special <n> tokens where a new clause within a 
block might start. The extra pass then translates these tokens into 
explicit semicolons and curly braces. 

The special tokens are inserted according to the following rules. 

- If a let, where, do, or of keyword is not followed by the 
lexeme {, the token -fin} is inserted after the keyword, where n 
is the indentation of the next lexeme if there is one, or 0 if the 
end of file has been reached. 

- If the first lexeme of a module is not { or module, then it is 
preceded by -fin} where n is the indentation of the lexeme. 

- Where the start of a lexeme is preceded only by white space 
on the same line, this lexeme is preceded by <n>, where n 
is the indentation of the lexeme, provided that it is not, as a 
consequence of the first two rules, preceded by -fin}. ((Marlow] 
|(ed.)|2010| §10.3) 

Between the lexer and the parser, an indentation resolution pass 
converts the lexeme stream into a stream that uses explicit semi¬ 
colons and curly braces to delimit clauses and blocks. The stream of 
tokens from this pass is defined to be L tokens [] where tokens 
is the stream of tokens from the lexer and L is the function in Fig¬ 
ure [3] Thus the context-free grammar only has to deal with semi¬ 
colons and curly braces. It does not deal with layout. 

This L function is fairly intricate, but the key clauses are the 
ones dealing with <n> and -fin}. After a let, where, do, or of 
keyword, the lexer inserts a -fin} token. If n is a greater indentation 
than the current indentation, then the first clause for -fin} executes, 
an open brace ({) is inserted, and the indentation n is pushed on 
the second argument to L (i.e., the stack of indentations). If a line 
starts at the same indentation as the top of the stack, then the first 
clause for <n> executes, and a semicolon (;) is inserted to start 
a new clause. If it starts at a smaller indentation, then the second 
clause for <n> executes, and a close brace (}) is inserted to close 
the block started by the inserted open brace. Finally, if the line is at 
a greater indentation, then the third clause executes, no extra token 
is inserted, and the line is a continuation of the current clause. The 


'The addiUonal indentation relation {(i + 2,») | i 6 N} is required by 
Occam as it has forms that require increasing indentation by exactly 2. 


L (<n>:ts) (m:ms) 

L (<n>:ts) ms 
L (fn}:ts) (m:ms) 
L (fn}:ts) [] 
L (fn}:ts) ms 
L (’}>:ts) (0:ms) 
L (’}’ :ts) ms 
L (’f’:ts) ms 
L ( t :ts) (m:ms) 

L ( t :ts) ms 
L [] □ 

L [] (m:ms) 


’;’ : (L ts (m:ms)) if m = n 

’}’ : (L «n>:ts) ms) if n < m 

>{’ : (L ts (n:m:ms)) if n > m 

>{> : (L ts [n]) if n > 0 

>V : : (L «n>:ts) ms) 

>}’ : (L ts ms) 

parse-error 

: (L ts (0:ms)) 

>}’ : (L (t:ts) ms) 

if m / 0 and parse-error(t) 
t : (L ts ms) 

□ 

’}’ : L □ ms if m^O 


Figure 3. Haskell’s L function (Marlow (ed,)|2010| § 10.3) 


effect of all this is that ;, and } tokens are inserted wherever 
layout indicates that blocks start, new clauses begin, or blocks end, 
respectively. The other clauses in L handle a variety of other edge 
cases and scenarios. 

Note that L uses parse-error to signal a parse error but uses 
parse-error (t) as an oracle that predicts the future behavior of 
the parser that runs after L. Specifically, 

if the tokens generated so far by L together with the next 
token t represent an invalid prefix of the Haskell grammar, 
and the tokens generated so far by L followed by the token 
“}” represent a valid pre fix of the Haskell gr ammar, then 
parse-error (t) is true. (Marlow (ed.)|2010| §10.3) 

This handles code such as 
let x = do f; g in x 

where the block starting after the do needs to be terminated before 
the in. This requires knowledge about the parse structure in order 
to be handled properly, and thus parse-error (t) is used to query 
the parser for this information. 

In addition to the operational nature of this definition, the use of 
the parse-error (t) predicate means that L cannot run as an inde¬ 
pendent pass; its execution must interact with the parser. In fact, the 
Haskell implementations GHC (GHC|2011| | and Hugs (Jones|1994) 
do not use a separate pass for L. Instead, the lexer and parser share 
state consisting of a stack of indentations. The parser accounts for 
the behavior of parse-error (t) hy making close braces optional 
in the grammar and appropriately adjusting the indentation stack 
when braces are omitted. The protocol relies on “some mildly com¬ 
plicated interactions between the lexer and parser” (Jones|[I994) 
and is tricky to use. Even minor changes to the error propagation 
of the parser can affect whether syntactically correct programs are 
accepted. While we may believe in the correctness of these parsers 
based on their many years of use and testing, the significant and 
fundamental structural differences between the language specifica¬ 
tion and these implementations are troubling. 

3.1.2 Grammar 

While the specification of Haskell’s layout rule is complicated, it 
can be easily and intuitively specified using our indentation opera¬ 
tors. By using these operators there is no need for an intermediate L 
function, and the lexer and parser can be cleanly separated into self- 
contained passes. The functionality of parse-error (t) is simply 
implicit in the structure of the grammar. 

For example, Figure [4] shows productions that specify the case 
form and its indentation rules. With regard to terminals, we anno¬ 
tate most of them with an indentation relation of > in order to allow 
them to appear at any column greater than the current indentation. 


123 



















case —l ’case^ ; exp ; ’of; (eAlts (|) iAlts) 

eAlts —} ’{ ,> ; alts® ; ’F’® 

iAlts -*• (|alts|*) > 

alts —>■ (alt’ (|) alt) ; alt’* 

alt’ —f ’;; (alt (|) e) 


Figure 4. Productions for Haskell’s case form 


We use > instead of > because Haskell distinguishes tokens that 
are at an indentation equal to the current indentation from tokens 
that are at a strictly greater indentation. The former start a new 
clause while the latter continue the current clause. An exception 
to this rule is the closing curly brace (}) of an explicitly delimited 
block. Haskell’s indentation rule allows it to appear at any column. 
Thus, eAlts annotates it with © instead of the usual >. 

In Haskell, a block can be delimited by either explicit curly 
braces or use of the layout rule. In Figure [4] this is reflected by 
the two non-terminals eAlts and iAlts. The former expands to 
’ > ; alts® ; ’}’® where alts is a non-terminal parsing a 

semicolon-separated sequence of case alternatives. The © relation 
allows alts to not respect the indentation of the surrounding code. 

The other non-terminal, iAlts, expands to (|alts|*) > . The > 
relation increases the indentation, and the repetition operator allows 
zero or more |alts| to be parsed. Due to the > relation, these may 
be at any indentation greater than the current indentation, but they 
still must be at the same indentation as each other as they are all 
children of the same parsing expression, | alt s | *. The use of | alt s | 
instead of alts ensures that the first tokens of the alts are all 
at the same indentation as the |alts| itself. Thus the alternatives 
in a case expression all align to the same column as each other. 
Note that because iAlts refers to alts instead of alt, we have 
the option of using semicolons (;) to separate clauses in addition to 
using layout. When using curly braces to explicitly delimit a block, 
semicolons must always be used. 

Haskell has a side condition requiring every case to contain at 
least one alt. It cannot contain just a sequence of semicolons (;)• 
This can be implemented either as a check after parsing or by split¬ 
ting alts and |alts|* into different forms depending on whether 
an alt has been parsed. 

Other grammatical forms that use the layout rule follow the 
same general pattern as case with only minor variation to account 
for differing base cases (e.g., let uses decl in place of alt) and 
structures (e.g., a do block is a sequence of stmt ending in an exp). 

Finally, GHC also supports an alternative indentation rule that 
is enabled by the RelaxedLayout extension. It allows opening 
braces to be at any column regardless of the current indenta¬ 
tion ^GHC^OIT] §1.5.2). This is easily implemented by changing 
eAlts to be: 

eAlts —>• ’{’® ; alts® ; ’}’® 

3.2 Python 
3.2.1 Language 

Python represents a different approach to specifying indentation 
sensitivity. It is explicitly fine oriented and features NEWLINE in its 
grammar as a terminal that separates statements. The grammar uses 
INDENT and DEDENT tokens to delimit indentation-sensitive forms. 
An INDENT token is emitted by the lexer whenever the start of a line 
is at a strictly greater indentation than the previous line. Matching 
DEDENT tokens are emitted when a line starts at a lesser indentation. 

In Python, indentation is used only to delimit statements, and 
there are no indentation-sensitive forms for expressions. This, com¬ 
bined with the simple layout rules, would seem to make parsing 


Python much simpler than for Haskell, but Python has line joining 
rules that complicate matters. 

Normally, each new line of Python code starts a new statement. 
If, however, the preceding line ends in a backslash (\), then the 
current line is “joined” with the preceding line and is a continuation 
of the preceding line. In addition, tokens on this line are treated as 
if they had the same indentation as the backslash itself. 

Python’s explicit line joining rule is simple enough to imple¬ 
ment directly in the lexer, but Python also has an implicit line join¬ 
ing rule. Specifically, expressions 

in parentheses, square brackets or curly braces can be split 

over more than one physical line without using backslashes. 

. .. The in dentation of the continuation lines is not important. 

( |Python| §2.1.6) 

This means that INDENT and DEDENT tokens must not be emitted by 
the lexer between paired delimiters. For example, the second line of 
the following code should not emit an INDENT, and the indentation 
of the third line should be compared to the indentation of the first 
line instead of the second line. 

x = [ 

y ] 

z = 3 

Thus, while the simplicity of Python’s indentation rules is attrac¬ 
tive, they contain hidden complexity that requires interleaving the 
execution of the lexer and parser. 

3.2.2 Grammar 

Though Python’s specification presents its indentation rules quite 
differently from Haskell’s specification, once we translate it to use 
our indentation operators, it shares many similarities with that of 
Haskell. The lexer still needs to produce NEWLINE tokens, but it 
does not produce INDENT or DEDENT tokens. As with Haskell, we 
annotate terminals with the default indentation relation >. 

In Python, the only form that changes indentation is the suite 
non-terminal, which represents a block of statements contained 
inside a compound statement. For example, one of the productions 
for while is: 

while_stmt —> ’while^ ; test ; ’ : J> ; suite 

A suite has two forms. The first is for multi-line statements, and 
the second is for single-line statements that are not delimited by 
indentation. The following productions handle both of these cases. 

suite —t NEWLINE > ; bloc^ 

<|) stmt_list ; NEWLINE > 

block —t | statement |* 

When a suite is of the indentation-sensitive, multi-line form (i.e., 
using the left-hand side of the choice), the initial NEWLINE token 
ensures that the suite is on a separate line from the preceding 
header. The block inside a suite must then be at some indenta¬ 
tion greater than the current indentation. Such a block is a sequence 
of statement forms that all start with their first token at the same 
column. In Python’s grammar, the productions for statement al¬ 
ready include a terminating NEWLINE, so NEWLINE is not needed in 
the productions for block. 

Finally, for implicit line joining, we employ the same trick as for 
braces in Haskell. For any form that contains parentheses, square 
brackets, or curly braces, we annotate the part contained in the 
delimiters with the © indentation relation. Since the final delimiter 
is also allowed to appear at any column, we annotate it with ©. For 
example, one of the productions for list construction becomes: 

atom —> ’ [* > ; listmaker® ; ]® 
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Empty string 

(e,w) =* (l,T(e)) 


Terminal 

( 0 , aw) => (1, T(o)) 



(a, bw) =+ (1, X) 

(a, e) => (1,-L) 

iia^b 

Non-terminal 

(A, w) =>■ (n + 1, 0 ) 

if (6{A),w)^(n,o) 

Sequence 

(pi;P2, W1W2U) => (m +n2 + 1,T(wiu;2)) 

if (pi, W1W2U) =>■ (7ii,T(wi)) 

and (p 2 ,w 2 w) =>■ (7i 2 ,T(u;2)) 


(pi;p2, W1W2U) => (m + 1,_L) 

if (pi,wiW2u) => (m, -L) 


(pi;P2, W1W2U) => (m +n2-l- 1,-L) 

if (pi, W1W2U) => (ni,T(u;i)) 
and (p2,W2u) => ( 712 , -L) 

Lookahead 

(!p, wu) =4- (n +1, T(e)) 

if (p, wu) => (71, _L) 


(!p, wu) =>■(» +1, X) 

if (p, um) =>■ (n, T («>)) 

Choice 

(pi (|) p2,wu) =>■ (ni + 1, T(w)) 

if (pi, tim) => ( 711 , T(w)) 


(pi (I) P 2 > wu) => (712 + 1,0) 

if (pi, wu) (m, -L) 
and (p 2 , wu) => (n.2,0) 

Repetition 

(p*,WlW2U) =+ (ni +712 + 1, T(7i7l7U2)) 

if (p, W1W2U) =+ (m, T (wi)) 
and (p*,to 2 «) => (7i2,T(w 2 )) 


(p*, W1W2U) =£- (n + 1, T(e)) 

if (p, W1W2u) =$■ (n, -L) 


Figures. Semantics of PEG 


4. Parsing Expression Grammars 

In order to formalize our notion of indentation-sensitive parsing, 
we first review the formal semantics of PEG before extending it 
to support indentation sensitivity. In Section [5] we show how this 
semantics corresponds to and is implemented in Parsec. 

4.1 Parsing Expression Grammars 

Parsing expression grammars (PE G) are a modem recastin g of top- 
down parsing languages (TDPL) ( |Aho and Ullman|1972) that has 
recently become quite popular and has a large number of imple¬ 
mentations. Aside from the fact that PEG uses parsing expressions 
instead of productions, the main difference between PEG and CFG 
is that all choices are biased so there is only ever one possible result 
for an intermediate parse. For example, the choice operator, (|), is 
left biased. Ambiguous parses are thus impossible by construction. 

From a practical perspective, this model makes it easy to imple¬ 
ment PEG as a top-down parser where each terminal translates to 
a primitive, each non-terminal translates to a function, and the se¬ 
quencing operator translates to sequencing in the code. In addition, 
the backtracking logic is relatively easy to implement. A choice 
operator first attempts to parse its left-hand side. Only if that fails 
does it backtrack and attempt to parse its right-hand side. 

As formally defined by |Ford|f2004| l, a parsing expression gram¬ 
mar, G, is a four-tuple G = (A, E, d. S') where A r is a finite set of 
non-terminal symbols, E is a finite set of terminal symbols, <5 is a fi¬ 
nite production relation, and S 6 IV is the start symbol. This much 
is identical to the traditional definition of a context-free grammar. 
The difference comes in how 5 is defined. It is a mapping from a 
non-terminal symbol to a parsing expression and we write A —r p 
if 5 maps A to p. Unlike in CFG, there is only one p to which a 
given A maps, and thus we write 5 (A) to denote that parsing ex¬ 
pression. 

The formal semantics for the operators in a parsing expression 
are given in terms of a rewrite relation from a pair, (p, w), of the 


parsing expression, p, and an input word, w, to a pair, (n, o), of a 
step counter, n, and a result, o. The result o is either the portion of 
w that is consumed by a successful parse or, in the case of failure, 
the distinguished symbol _L. For the sake of clarity, when o is not 
_L, we write it as T (w) where w is the parsed word. This rewrite 
relation is defined inductively as shown in Figure[5] Note that while 
the step counter is used to complete inductive proofs about PEG, it 
is not needed by the parsing process and can usually be ignored. 

The intuition behind these rules is fairly straightforward. The 
empty parsing expression, e, succeeds on any input in one step. 
A terminal parsing expression succeeds on an input where next 
token is the terminal that the parsing expression expects and fails 
otherwise. A non-terminal runs the parsing expression associated 
with that non-terminal. Sequencing succeeds and consumes W1W2 
if the first parsing expression, pi, consumes wi on input W1W2U 
and the second parsing expression, p2, consumes W2 on input 
W2U. Lookahead succeeds only if p fails and fails otherwise. The 
choice form is one of the characteristic features of PEG and is left 
biased. If pi successfully consumes w on input wu, then the choice 
operator also succeeds by consuming w on input wu. Otherwise, 
if pi fails, then p2 is run. The repetition operator is greedy. If p 
successfully consumes wi on input wi W2U and p* successfully 
consumes W2 on input W2U, then p* consumes W1W2 on input 
W1W2U. Otherwise, if p fails, then p* succeeds while consuming 

4.2 Indentation Sensitivity 

In order to add indentation sensitivity to the semantics of PEG, we 
need to pass information about layout to each parse. While it is 
tempting to think that this would just be the value of the current 
indentation, that is not sufficient. For example, suppose we are 
parsing the iAlts of a case expression and the case expression 
is at indentation 1. The body of that iAlts is allowed at any 
indentation greater than 1, but we do not know which indentation 
grater than 1 to use until iAlts consumes its first token. So, 
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Empty string 

fe,te,7,/)=»( l.Tfte)) 


Terminal 

(a,a*tu,J,/)=F(l,T$ 0 (a)) 

iff €7 


(MV7,/)^(i,i) 

(a, e, 7, /) => (1, X) 

ifa=£bori<£l 

Non-terminal 


if (6(A), w, I, m) (n, 0 ) 

Sequence 

(pi;P2, W1W2U, 7, /) =>■ (m +ri2 + 1,T^(wiw 2 )) 

if (pi,wiW2U,I,f) => (ni,Tj(tdi)) 
and (p 2 , w 2 u, J, g) =$■ (n2,Tx(w2)) 


(pi',P2,wiW‘2U, I, f) =>■ (m + 1,_L) 

if (pi,wiW2U, /) => (m, _L) 


(pi;P 2 , mim 2 u,7,/) =>■ (m +n 2 + 1,X) 

if (pi,wiw 2 u, 7, /) => (fti,Tj(wi}} 
and (p 2 , w 2 u, J, g ) =£- (n 2 , -L) 

Lookahead 

(\p,wu,I,f) => (n + 1 ,,Tj(e)) 

if (p, wu, 7, /) => (n,X) 


(>p,MU,7,/)=*(n + l,X) 

if (p, mu, 7, /) => (n, Tj(w)) 

Choice 

(pi (|> P 2 , mu, 7, /) =£> (ni + 1, Tj(w)) 

if (pi>wu, I, f) => (hi, T®(m)) 


(pi (l> P2,wu, 7, /) =S> (jl2 + 1, 0 ) 

if (pi, mu, 7,/) =+■ (m,_L) 
and (p 2 , mu, 7, /) (n 2 , 0) 

Repetition 

(p*,w 1 w 2 u, I, f ) =*► (ni + n 2 + 1 , Tk(wi(i) 2 )) 

if (p, mim 2 u, 7, /) (m, T® (mi)) 

and (p*,M 2 U, J, g) =$■ (ri 2 ,Tx(M 2 )) 


(p*,wiW2U,I,f) => (n + l,T{(e)) 

if (p, mim 2 u, 7, /) (n, X) 

Indentation 

(p > , WW ,74)^(rt+l,Tf,(m)) 

if (p, mu, J,|) (n,Tj,(w)) 

where J — {j ( j G N, 3 i € 1,3 > *} 

I' = {i\ie 7,3j € J',j>i} 



if (p, mu, (n, X) 

where J = {j | j 6 N, 3 i e 7, j > i} 


(t^wt, 7, ||)=>.(n+ 1,0} 

if (p, mu, 7, ||) =+ (n, 0 ) 

Absolute alignment 

(|p|,u;«,7,/)^(n+l,0) 

if (P, mu, 7, ||) =» (n, 0 ) 


Figure 6. Indentation-sensitive semantics of PEG 


instead of passing a single indentation, we must pass a set of 
allowable indentations. In our example, since the case expression 
is at indentation 1, the body of iAlts is passed the set {2, 3,4, • ■ ■ } 
as the allowable indentations. 

However, this is still not enough. Consider for example, the 
parsing expression ’a’; (’b ,Ss (|) e). If a occurs at indentation i 
in the input, then b must be allowed at only indentations strictly 
greater than i. This is even though J a’ does not contain ’b J and 
merely occurs sequentially earlier in the parsing expression. 

Further, since PEG uses a biased choice, we must use the right- 
hand side of ’b’ > (|) e only if it is impossible to parse using its 
left-hand side. However, whether ’b' > succeeds or not is entirely 
dependent on the indentation at which ’ a ’ succeeds. For example, 
on the input word a 1 b 2 , the parser for ’a’ succeeds at 1, and thus 
’b’ can be attempted at any indentation greater than 1. Since 2 is 
in that range, the parser for ’ b ’ succeeds, and e is never called. 
However, with the input word a 3 b 2 , the a token is at indentation 3, 
which restricts the allowed indentations for ’b’ to {4,5, 6, • ■ • }. 
Thus the parser for ’ b ’ fails, and e is used. 

In other words, since choices are biased, parses earlier in the 
input affect whether the left-hand side of a choice succeeds and 
thus whether the right-hand side should even be attempted. Thus 
indentation sets must be passed as both input and output in order 


to both control the indentations at which a parse is attempted and 
report the indentations at which it succeeds. 

In addition to handling indentation relations, we must also han¬ 
dle the p operator. This can be achieved by passing a flag to each 
parser indicating whether we are inside a |p| that has not yet con¬ 
sumed a token. If we are, we must not change the current indenta¬ 
tion set and thus ignore any p > operators. 

We formally specify all this by generalizing the PEG rewrite 
rules to be a relation from a tuple ( p , w, I, f ) to a pair (n, o) where 
p is a parsing expression, w is an input word, 7 C N is an input 
indentation set, / € {||,|)'} is an absolute-alignment flag, n is a 
step counter, and o is a result. The absolute-alignment flag is || to 
indicate that we are inside a |p| that has not yet consumed a token 
and )/( otherwise. The result o is either a pair of the portion of w that 
is consumed by a successful parse along with a result indentation 
set 7 C N and flag / e {||,^} or, in the case of failure, the 
distinguished symbol _L. When o is not _L, we write it as T£ (m) 
where w, 7, and / are respectively the parsed word, the output 
indentation set, and the absolute-alignment flag. Finally, the tokens 
in words are all annotated with indentations so w € (E x N)*. 

The rules from Figure[5]then straightforwardly generalize to the 
rules in Figure [6] The empty parsing expression, e, succeeds on 
any input and so returns 7 and / unchanged. The terminal parsing 
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1 -:- j -i-x“ ( alts - Right 4 • ••, {5}, 1) =>{*--, ±) 

(alts,Left ... ,{2,3,4,-..},||) =»(•■• , T {5} (•••)) (| alts ,, Right 4 ,,, , {5} j) ^ ( .., , ±) 

(|alts|, Left 5 • • • ,{2,3,4, •••},#) =>(••• , t| 5} (•••)) (|alts|* , Right 4 ■ , {5} 4) =>■(•■•, ±) 

(|alts|* , Left 5 • ■ • , {2,3,4,---}, f) =»(••• ,t| b} (•••)) 

((laltslT , Left 5 • • ■ , {1} ,#} =► (• • • , T{ 1} (• • •)) 

(iAlts, Left 5 • • • , {1} 4) => (• • • , T| 1} (•••)) 


Figure 7. Example parse derivation 
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Figure 8. Example parse tree 


expression, however, succeeds only when i, the indentation of the 
consumed token, is in the set of allowed indentations. Then, as 
a token has now been consumed, it clears the flag. In that case, 
it returns the singleton {*} as the only indentation at which it 
succeeds. In all other cases, it fails. 

The sequencing operator just threads the indentation set and flag 
through both pi and p2. Lookahead is similar and just passes the 
indentation set and flag through unchanged. The choice operator 
passes the same indentation set and flag to both parsers. 

The interesting cases here are the newly added operators for 
indentation, p > , and absolute alignment, |p|. The indentation op¬ 
erator runs the parsing expression p with a new indentation set J 
computed according to > and 7. Specifically, every element of J is 
related by > to some element of 7. For example, if we have p > with 
I = {1, 2}, then J = {2,3,4, • ■ • }. Once the parsing of p com¬ 
pletes, the indentations at which it succeeded, J', are compared to 
the original indentation set, 7, to see which elements of 7 are com¬ 
patible according to >. Those elements of 7 are then returned in the 
output indentation set, 7'. 

An exception to this is when we are parsing in absolute mode. 
That is to say, when / is ||. In that case, the parent and child 


must have identical indentations despite the p > operator. Thus, the 
indentation set does not change, and the p > is effectively ignored. 

Finally, the |p| operator is trivial and merely sets the flag to ||. 

4.3 Example Derivation 

As an example of this semantics, consider parsing the following 
Haskell code with the productions in Figure [?] 

f x = 

case x of 

Left _ -> id 
Right 

Because case occurs at column 3, Left occurs at column 5, and 
Right occurs at column 4, the Right token should not be part of 
the case expression. Thus this code is equivalent to the following. 

f x = (case e of Left _ -> id) Right 

When initially parsing the right-hand side of f, the indentation set 
and flag will be {1} and !/(. As the parser proceeds, it will consume 
the case, x, and of tokens. In the grammar, the terminals for these 
are annotated with the > indentation relation, and in the input, 
the indentations of these tokens are all greater than 1. Thus, these 
tokens are successfully consumed without changing the indentation 
set or flag. Once we get to the Left token though, the current 
parsing expression will be eAlts (|) iAlts. Since the next token 
is not {, eAlts will fail and a parse of iAlts will be attempted. 

At this point, indentation sensitivity starts to play a role. The 
fragment of the parse derivation for this part is shown in Figure [7] 
First, iAlts unfolds into (|alts|*) > . The > relation means that 
we change from using the {1} indentation set to the {2,3,4, • • • } 
indentation set. The |alts|* then calls |alts|, which in turn sets 
the flag to ||. With this flag set, intermediate indentation relations 
are ignored so the indentation set does not change until we get to 
the parsing expression that actually consumes Left. Though the 
terminal for consuming this token will be wrapped with the > 
relation as explained in Section |3.1.2[ this will be ignored as the 
flag is || at that point. Thus, when consuming the Left token, the 
indentation set is {2,3,4, • • • }. Since the indentation of Left (i.e., 
5) is in that set, the token is successfully consumed. The flag is then 
set to jj, and the indentation set becomes {5}. This indentation set 
is used when parsing the remainder of the clause. Since terminals 
are wrapped by the > relation, this means that each token in that 
clause is allowed at any column in the set {j \ i € {5} ,j>i} = 
{6, 7,8, ■ ■ ■ }. This distinction between the first token of |alts| 
(which must have an indentation equal to the indentation of jaltsj 
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data IndentationRel = Eq I Ge I Gt I Any 

locallndentation :: IndentationRel 
-> Parsed (IndentStream s) u m a 
-> Parsed (IndentStream s) u m a 

absolutelndentation :: 

Parsed (IndentStream s) u m a 
-> Parsed (IndentStream s) u m a 

localTokenMode :: IndentationRel 
-> Parsed (IndentStream s) u m a 
-> Parsed (IndentStream s) u m a 


Figure 9. Parsec combinators for indentation sensitivity 


itself) and the other tokens of | alt s | (which must have indentations 
greater than the indentation of |alts|) allows us to handle the 
distinction that Haskell makes between tokens at an indentation 
equal to the current indentantion (which start a new clause) and 
tokens at a greater indentation (which continue the current clause). 

In Figure [7] once the remainder of that alts is parsed, the 
indentation set{5} is threaded back out through |alts| to |alts|*. 
The indentation set and flag are then used in the second branch 
of |alts|* where the process proceeds as it did before. This time, 
however, the next token (i.e., Right) is at indentation 4, which is 
not an element of the indentation set {5}. Thus that token cannot 
be consumed, and the result is _L. This causes the case expression 
to stop at this point and leaves the Right token for a surrounding 
function application to consume. 

The final parse tree for this expression is then as shown in Fig¬ 
ure |8j We can see in this tree how IbRight cou ld not be a descen¬ 
dant of (|alts|*) 5 as their indentations do not relate according to 
the relations specified in the grammar. 

5. Parsec 

With this formal model, we can now consider how to implement 
indentation sensitivity for Parsec. The basic types and operators 
that we add to Parsec are shown in Figure[9] The IndentationRel 
type represents an indentation relation where Eq is =, Ge is >, 
Gt is >, and Any is ©. The expression locallndentation r 
p apphes the indentation relation r to p and corresponds to p r . 
Likewise, absolutelndentation p ensures that the first token 
of p is at the current indentation and corresponds to |p|. Finally, 
localTokenMode locally sets a default IndentationRel that is 
applied to all tokens. This eliminates the need to explicitly annotate 
the tokens in most productions. 

To see how to implement these operations, first, we examine 
how PEG relates to Parsec. Then, we discus the practical imple¬ 
mentation of the indentation-sensitive semantics in Parsec. 

5.1 Parsec Internals 

The semantics of PEG corresponds closely to the behavior of Par¬ 
sec. Since this connection is not often made explicit, we now delve 
into the details of how Parsec is implemented and show how it cor¬ 
responds to the PEG semantics. 

Note that we are considering the semantics of PEG and Parsec 
and not their implementations. PEG implementations commonly 
cache the results of parses in order to ensure a linear bound on 
parsing time. Parsec does not do this, and relatively simple Par¬ 
sec grammars can take exponential time. Nevertheless, though the 
implementation and the run times of these parsers can vary quite 
widely, the semantics of these systems correspond. 


newtype Parsed s u m a = Parsed { 
unParser :: forall b. 

State s u 

-> (a -> State s u -> ParseError -> m b) 
-> (ParseError -> m b) 

-> (a -> State s u -> ParseError -> m b) 
-> (ParseError -> m b) 

-> m b 

} 

data State s u = State { 
statelnput :: s, 
statePos :: SourcePos, 
stateUser :: u 

} 


Figure 10. Data types for Parsec 


In Parsec, a parser is represented by an object of type Parsed. 
This type is shown in Figure[To| The s parameter is the type of the 
input stream. The u parameter is the type of the user state that is 
threaded through parser computations. The m parameter is the type 
of the underlying monad, and the a parameter is the type of the 
result produced by the parser. 

The State s u parameter to unParser is the input to the 
parser. It is similar to the w in a (p. w) => (n, o) rewrite and 
contains the input stream in the statelnput field. In addition, 
statePos contains the source position, and stateUser contains 
user-defined data. 

The remaining parameters to unParser are continuations for 
different types of parse result. The continuations of type a -> 
State s u -> ParseError -> m b are for successful parses. 
The parameter a is the object produced by the parse. State s 
u is the new state after consuming input, and ParseError is a 
collection of error messages that are used if the parser later fails. 
On the other hand, the continuations of type ParseError -> m b 
are for failed parses where the ParseError parameter contains the 
error message to be reported to the user. 

These two types of continuations are very similar to the success 
and failure continuations often used to implement backtracking. 
One difference, however, is that there are two each of both sorts 
of continuation. This is because by default Parsec attempts further 
alternatives in a choice operator only if the previous failures did not 
consume any input. For example, consider the parsing expression 
(’a’; ; b’) (|) (’a’; ’ c’) on the input ac. The parsing expression 
’a’; ’b’ will fail but only after consuming the a. Thus in Parsec, 
the failure of ’ a ’; ’ b ’ is a consumed failure, and the alternative 
parsing expression ’ a ’; ’ c ’ is not attempted. 

Parsec also includes the try operator, which makes a consumed 
failure be treated as an empty failure. For example, if we use 
(try (’a’; ’b’)) (|) (’a 1 ; ’ c') on the same input, then the failure 
of ’ a ’; ’ b ’ is treated as an empty failure, and the alternative 
’ a ’; ’ c ’ is attempted. 

In the Parsed type, the second and third arguments to the 
unParser function are continuations used for consumed success 
or consumed failure, respectively. Likewise, the fourth and fifth ar¬ 
guments are continuations used for empty success or empty failure, 
respectively. For example, the parser for the empty string does not 
consume any input and should thus always produce an empty suc¬ 
cess. Such a parser is easily implemented as follows, where a is the 
object to be returned by the parser, and e is an appropriately defined 
ParseError. 

parserReturn a = 

Parsed $ \s edk _ -> eOk a s e 
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data Consumed a = Consumed a 
I Empty a 


class (Monad m) => Stream s m t I s -> t where 
uncons :: s -> m (Maybe (t,s)) 


data Reply 


Ok a (State s u) ParseError 
Error ParseError 


Figure 13. Code for the Stream class 


Figure 11. Data types for Parsec parse results 


type Indentation = Int 

infind = maxBound :: Indentation 


data IStream s = IStream { 
iState :: IState, 
tokenStream :: s 

} 


data IState = 
minlnd :: 
maxlnd :: 
absMode :: 
tokenRel :: 

} 


IState { 
Indentation, 
Indentation, 
Bool, 

IndentationRel 


Figure 12. Data types for indentation sensitivity 


This parser simply calls eOk, which is the continuation for empty 
success. 

On the other hand, the parser for a character c consumes input 
and is implemented as follows, where el and e2 are appropriately 
defined ParseError objects. 

parseChar c = Parsed $ \s cOk _ _ eErr -> 
case statelnput s of 
(x : xs) I x == c -> 

cOk x (s { statelnput = xs }) el 
-> eErr e2 

This parser checks the input s to see if the next character matches c. 
If it does, cOk, the consumed success continuation, is called with an 
updated State. Otherwise, eErr, the empty failure continuation, is 
called. 

The continuation passing style of Parsed can be difficult to 
reason about, but we can convert it to direct style where it returns 
an object with different constructors for different kinds of results. 
Parsec provides such an alternate representation using the types in 
Figure [TT] Thus, the ParsecT type is equivalent to a function from 
State s utom (Consumed (Reply s u a)). 

Represented in these terms, the correspondence between PEG 
and Parsec is straightforward. The Parsec parser contains extra in¬ 
formation that is not present in PEG such as the SourcePosition 
and user state stored in the State, whether a parser consumes in¬ 
put or not, the monad m, and the result value of type a. However, 
if we elide this extra data, then a Parsec parser is simply a func¬ 
tion from an input word stored in the State to either a successful 
or failed parse stored in Reply. This corresponds to a PEG rewrite 
(p, w) => (n.o) from an input word, w, to either a successful or 
failed result, o[j 

5.2 Indentation Sensitivity 

Given the correspondence between PEG and Parsec, we can now 
implement indentation sensitivity in Parsec. The primary challenge 
here is the representation of the indentation set, 7. Since this set 
may be infinitely large (such as at the start of p in p > ), we need to 
find an efficient, finite way to represent it. Fortunately, the follow¬ 
ing theorem allows us to construct just such a representation. 


2 There is still a difference in that a Parsec Reply stores the remaining input 
whereas in PEG o contains the consumed input, but these are equivalent in 
this context. 


instance (Stream s m (t. Indentation)) => 
Stream (IStream s) m t where 
uncons (IStream is s) = do 


case x of 

Nothing -> return Nothing 
Just ((t, i), s’) -> 

return $ updatelndentation is i ok err 

ok is’ = Just (t, IStream is’ s’) 
err = Nothing 


Figure 14. Code for IStream and its Stream instance 


Theorem 1. When parsing a parsing expression p that uses inden¬ 
tation relations only from the set {=, >, >, ®}, all of the interme¬ 
diate indentation sets are of the form {j \ j G N, i < j < k} for 
some t£N and k € N U {oo} provided the initial indentation set 
passed to p is also of that form. 

Proof. By induction over p and the step counter n. □ 

As a result of this theorem, each indentation set can be rep¬ 
resented by a simple lower and upper bound. This leads to the 
IState type defined in Figure [T2] which we thread through the 
parsing process to keep track of all the state needed for indenta¬ 
tion sensitivity. The minlnd and maxlnd fields of IState repre¬ 
sent the lower and upper bounds, respectively. The infind con¬ 
stant represents when maxlnd is infinite. The absMode field is used 
to keep track of whether we are in absolute alignment mode. It is 
True when the flag / would be || and False when it would be 
i/f. The tokenRel field stores a default indentation relation that 
surrounds all terminals. For example, in Haskell, most terminals 
are annotated with > in the grammar. Since requiring the user to 
annotate every terminal with an indentation relation would be te¬ 
dious and error prone, we can instead set tokenRel to Gt. Imple¬ 
menting the local Indent at ion, absolutelndentation, and 
localTokenMode operators is then a simple matter of each opera¬ 
tor modifying the IState according to the semantics in Figure[6] 

The final consideration is how to thread this IState through 
the parsing process and update it when a token is consumed. The 
design of Parsec restricts the number of ways we can do this. The 
type ParsecT is parameterized by the type of the input stream, s, 
the type of the user state, u, the type of the underlying monad, m, 
and the result type, a. We could store an IState in the user state, 
u, and require the user to call some library function at the start of 
every token that then updates the IState. However, that would be 
a tedious and error prone process. On the other hand, for parsers 
that use Parsec’s LanguageDef abstraction, adding the check to 
the lexeme combinator would handle many cases, but even then, 
many primitive operators such as char, digit, and satisfy do 
not use lexeme so we would have to be careful to also add checks 
to such primitives. 
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A more robust solution is to update the IState every time 
Parsec reads a token from the input. Parsec reads tokens using 
the uncons operation of the Stream class shown in Figure [13] 
Unfortunately, within this class we do not have access to the user 
state, u, and thus cannot store the IState there. We must store 
the IState in either the stream, s, or the monad, m. Normally, 
the monad would be the natural place to store it. However, the 
choice operator, (|), in Parsec does not reset the monad when the 
left-hand side fails. Thus any changes to the state made by the 
left-hand side would be seen in the parser for the right-hand side. 
This is not what we want. The IState used in the right-hand 
side should be the original one before any changes were made 
by the left-hand side. The Stream, s, is the only place where 
we can store the IState. Thus in Figure [T4] we define a new 
stream type, IStream, that takes a stream of tokens paired with 
indentations and calls updatelndentation whenever a token is 
read by uncons. Given the current IState, is, the indentation of 
the current token, i, and success and failure continuations, ok and 
err, updatelndentation computes whether i is in the current 
indentation set. If it is, updatelndentation calls ok with a new 
IState, is’, that is updated according to the semantic rule for 
terminals from Figure^ Otherwise, it calls err. This ensures that 
updatelndentation is called for every terminal and properly 
backtracks for operators such as (|). 

Due to limitations of the Parsec interface, storing the IState 
here does have a significant drawback, however. In uncons there 
is no way to signal a parse error except by returning Nothing. 
Signaling some sort of error in the monad, m, will not work. Since 
m is the monad inside Parsed and not the Parsed monad itself, 
the error will not be caught by combinators such as (|) that should 
try alternatives when an indentation check fails. 

Returning Nothing achieves the desired integration with the 
Parsec combinators, but it is not an ideal solution as that is also the 
signal for the end of a Stream. Since invalid indentation and input 
exhaustion are conflated, a parse could appear to finish and con¬ 
sume all of its input when it has merely met an invalidly indented 
token. Another problem is that if a parse fails due to an invalid 
indentation, the error message will be one for input exhaustion in¬ 
stead of one for an indentation violation. To remedy this problem, 
it is important to run localTokenMode (const Any) eofatthe 
end of the parse to detect this situation and report an appropriate 
error message. 

Alternative solutions would be to have the user insert explicit 
indentation checks or change the design of Parsec to allow uncons 
to signal errors other than input exhaustion. The latter option would 
require changes to Parsec as a whole but would make Parsec more 
flexible and is relatively straightforward. 


6. Benchmarks 

In order to test the practicality of this implementation of indenta¬ 
tion sensitivity on a real-word language we converted the Idris 0.9.8 
compiler to use our parsing library. While a Haskell compiler 
would have been a natural choice, in order to get a meaningful per¬ 
formance comparison, we needed to modify a language implemen¬ 
tation that was already based on Parsec. The only Haskell imple¬ 
mentation we found that does this is Helium, but Helium supports 
only a subset of Haskell forms. After considering several options, 
we chose Idris as its parser is based on Parsec and uses syntax and 
layout rules similar to those of Haskell]^] 


3 More recent versions of Idris use Trifecta instead of Parsec. We have 
successfully ported our implementation to also work with Trifecta and used 
the resulting library to parse Idris code. However, that port is still in its 
infancy, and we do not have benchmark results for it yet. 



6.1 Implementation 

Porting Idris to use our library was straightforward. The changes 
mainly consisted of replacing the ad hoc indentation operators in 
the original Idris parser with our own combinators. Since our com¬ 
binators are at a higher level of abstraction, this significantly sim¬ 
plified the parts of the Idris parser relating to indentation. In the 
core Idris grammar, approximately two hundred lines are dedicated 
to indentation. Those were replaced with half that many lines in our 
new system. In addition, this conversion fixed some rather signif¬ 
icant bugs in how Idris’s parser handles indentation. We describe 
these bugs in Section [63] 

6.2 Testing 

In order to test the performance of our parser, we tested it on Idris 
programs collected from a number of sources. These include: 

- the Idris 0.9.8 standard library ( |Brady|2013e| >; 

- the Idris 0.9.8 demos ( |Brady|2013c| >; 

- the Idris-dev examples, benchmarks, and tests ( |Brady|2013d| >; 

- the IdrisWeb web framework ( |Fowler|2013) ; 

- the WS-idr interpreter ]Brady|2QI3bf ; 

- the bitstreams library (Saunders|2013) ; and 

- the lightyear parsing library ( |Tejiscak|2013) . 

First, we tested that our parser produced the same abstract syntax 
trees as the original parser. In a few cases, it did not, but when 
we investigated, we found that these were all due to bugs in the 
implementation of indentation in the original Idris parser. In all 
other cases, we produced the same results as the original Idris 

Next, we benchmarked both parsers using Criterion l |0’Sullivan] 
|2012) . The benchmarks were compiled with GHC 7.6.3 and the 
-0 compilation flag. They were run on a 1.7GHz Intel Core i7 
with 6GB of RAM running Linux 3.11.10. The results of our 
benchmarks are shown in Figure [15] For each parsed file, we plot 
the parse time of our new parser relative to Idris’s original parser. 
Our parser ranged from 1.67 to 2.65 times slower than the original 
parser and averaged 1.95 times slower. 

6.3 Analysis 

One of the reasons our parser is slower is that, like Idris’s original 
parser, we are scannerless. Thus, uncons checks the indentation 
of every single character of input. This is unlike Idris’s original 
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Figure 16. Benchmark results with modified indentation checks 


parser, which checks the indentation at only certain manually- 
chosen points. As a result, however, the original parser has some 
significant bugs in how it handles indentation. In fact, we found 
several examples of Idris code that were erroneously parsed by the 
original parser. For example, in IdrisWeb we found the following 

expr = do t <- term 

do symbol "+" 
e <- expr 
pure $ t + e 
1 mplus ‘ pure t 

In this example, mplus occurs an an indentation that should cause 
it to be parsed as being outside both do expressions. The original 
Idris parser, however, does not check the indentation of the mplus 
lexeme. As a result, mplus is parsed as part of the last statement in 
the inner do expression. Since our new parser checks the indenta¬ 
tion of every character, it does not have this problem. 

In order to determine how much of the performance difference 
is due to this difference in where checks occur, we modified the 
original Idris parser to check all lexemes and our parser to check 
once per lexeme instead of once per character. We then reran the 
benchmarks. The Idris parser took on average 1.50 times what it did 
before, and our parser took on average 0.82 times what it did before. 
The relative performance of these modified parsers is plotted in 
Figure [16] Similar to before, for each parsed file, we plot the parse 
time of our parser relative to the Idris parser. Our parser ranged 
from 0.77 to 1.48 times slower than the original parser and averaged 
1.07 times slower. Thus, once we account for the differences in 
where indentation checks occur, the performance of our parser is 
on par with that of Idris’s parser. 

7. Related Work 

As discussed in Section |3.1.1| due to circularities introduced by 
parse-error (x), the parsing technique that uses an L function 
as described in the Haskell language specification (Marlow (ed.)| 
|2010) is generally not used by practical implementations. Instead, 
GHC and Hugs use shared state to coordinate between the lexer 
and parser. This relies on “so me mildly com plicated interactions 
between the lexer and parser” (Jones|[T994) and is tricky to use. 
The resulting code is difficult to reason about, and minor changes to 
error propagation in the parser can affect parse results. Even worse, 
this technique embeds assumptions about the L function and does 
not easily generalize to other indentation rules. 


The uulib parser library (Swierstra|2011| > implements indenta¬ 
tion using a similar approach, but it uses some intricate code involv¬ 
ing continuations to handle the circularity between the lexer and 
parser. Like the previous approach, this is hard coded to Haskell- 
style indentation a nd cannot easily ha ndle other layout rules. 

The indents (Anklesarla| |20 12) library is an extension to 
Parsec that provides a combinator to store the current position 
in a monad for later reference. It then provides combinators to 
check that the current position is on the same line, the same 
column, or a greater column than that reference position. The 
indentparser (Kurur|20l2) library is similar but abstracts over 
the type of the reference position. This allows more information to 
be stored than in indents at the cost of defining extra data types. 
In both systems, the user must explicitly insert indentation checks 
in their code. The resulting code has a much more operational feel 
than in our system. In addition, since these checks are added at 
only certain key points, the sorts of bugs discussed in Section [63] 
can easily arise. To the best of our knowledge there is no pub¬ 
lished, formal theory for the sort of indentation that these libraries 
implement. 

|Hutton| < |1992|> describes an approach to parsing indentation- 
sensitive languages that is based on filtering the token stream. This 
idea is further developed by |Hutton and Meijer| (1996) . In both 
cases, the layout combinator searches the token stream for appro¬ 
priately indented tokens and passes only those tokens to the com¬ 
binator for the expression to which the layout rule applies. As each 
use of layout scans the remaining tokens in the input, this can lead 
to quadratic running time. Given that the layout combinator filters 
tokens before parsing occurs, this technique also cannot support 
subexpressions, such as parenthesized expressions in Python, that 
are exempt from layout constraints. Thus, this approach is inca¬ 
pable of expressing many real-world languages including ISWIM, 
Haskell, Idris, and Python. 

|Erdweg et al.|(2012) propose a method of parsing indentation- 
sensitive languages by effectively filtering the parse trees generated 
by a GLR parser. The GLR parser generates all possible parse trees 
irrespective of layout. Indentation constraints on each parse node 
then remove the trees that violate the layout rules. For performance 
reasons, this filtering is interleaved with the execution of the GLR 
parser when possible. 

Our paper is an extension of the work in |Adams| (2013) , but 
where that work focused on bottom-up, LR(fc) parsing, this paper 
considers top-down parsing in Parsec and PEG. 

|Brunauer and Miihlbacher| (2006) take a unique approach to 
specifying the indentation-sensitive aspects of a language. They use 
a scannerless grammar that uses individual characters as tokens and 
has non-terminals that take an integer counter as parameter. This 
integer is threaded through the grammar and eventually specifies 
the number of spaces that must occur within certain productions. 
The grammar encodes the indentation rules of the language by 
carefully arranging how this parameter is threaded through the 
grammar and thus how many whitespace characters should occur 
at each point in the grammar. 

While encoding indentation sensitivity this way is formally pre- 
cise, it comes at a cost. The YAML specification (Ben-Kiki et al. 
2009! uses the approach proposed by Brunauer and Muhlbacher 
(2006) and as a result has about a dozen and a half different non- 
terminals for various sorts of whitespace and comments. With this 
encoding, the grammar cannot use a separate tokenizer and must 
be scannerless, each possible occurrence of whitespace must be ex¬ 
plicit in the grammar, and the grammar must carefully track which 
non-terminals produce or expect what sorts of whitespace. The au¬ 
thors of the YAML grammar establish naming conventions for non¬ 
terminals that help manage this, but the result is still a grammar that 
is difficult to comprehend and even more difficult to modify. 
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8. Conclusion 

This paper extends previous work on grammatical formalisms for 
indentation-sensitive languages to handle the top-down, combinator- 
based Parsec parsing framework. The resulting formalism is both 
expressive and easy to use. We use the connection between the 
semantics of PEG and Parsec to define a formal semantics for in¬ 
dentation sensitivity in these frameworks. Experiments on an Idris 
parser using this formalism show that, due to differences in how of¬ 
ten the indentation is checked, the parser runs about twice as slow 
as a parser using ad hoc techniques. Once the differences in how 
often indentation is checked are eliminated, our technique performs 
on par with the ad hoc techniques. The resulting library is available 
on Hackage as the indentation package and provides convenient 
indentation-sensitivity for Parsec-based parsers. 
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